Leveraging pre

2024-07-17 23:37| 来源: 网络整理| 查看: 265

We fine-tune all the models using the dataset outlined in the preceding section for the task at hand. Notably, the methodology employed by [24] involved a two-step approach. In the first step, an error-prone silver training corpus was utilized to train a relation extraction algorithm [25]. Subsequently, the model trained in the initial step was employed in conjunction with transfer learning techniques for fine-tuning with the GSC dataset. The authors demonstrated that this two-step transfer learning process yielded state-of-the-art (SOTA) results. Although this approach is interesting, we hypothesized the costly training of deep-learning models could be circumvented by directly fine-tuning pre-trained models. In essence, our strategy is to take models that were previously trained on extensive generic or domain-specific free text as the foundation, and subsequently fine-tune them specifically for the targeted problem using a minimal quantity of high-quality training data (in our case, GSC).

Methodology

Figure 5 shows the mechanism in which the evidence-question are taken in pairs for the fine-tuning purpose for the discriminative class of models namely BERT [30], BioMegatron [33], PubMedBERT [32], BioClinicalBERT [28] and BioLinkBERT [34]. For this, we resort to a typical tokenizing procedure of the evidence-question pair to produce token IDs and the attention mask. A maximum sequence length of 512 is maintained, as this is what BERT-based models are limited to. For all base models considered, during the fine-tuning process, a learning rate of \(5e-5\), a weight decay of 0.01, number of epochs=7, and Adam’s optimizer with a layer-wise learning rate decay of 0.9 was applied. All models were trained using an NVIDIA GeForce RTX 2080 Ti with 12GB memory, 16 CPUs, and 64GB memory. The fine-tuning process took about 30 min for each model.

Fig. 5

Illustration and mapping of the evidence and question tokens into the models for our discriminative class of models

Full size image

Among the generative models, GPT-3 model was fine-tuned using the OpenAI API of the GPT-3 davinci model. Details of the fine-tuning process are available on the OpenAI website (link: https://openai.com/api/). The inference parameters for the fine-tuned model are t=0.0, top_p=1, max_tokens=1 with other parameters with its default settings. BioMedLM’s stanford-crfm/BioMedLM model was fine-tuned on an A40 GPU instance with deepspeed setting for efficiency. This helped in training the model with 18GB GPU memory utilization. The model was trained for 20 epochs with batch_size=2, gradient_accumulation_steps=2, learning_rate=2e-06 with other parameters kept the same as their default settings. It took around 7hrs to complete 20 epochs. Similarly, BioGPT was fine-tuned on a single NVIDIA GeForce RTX 2080 Ti with 12GB memory. The model was trained using parameters similar to DDI (Drug-Drug Interaction) experiment in BioGPT [28] for Relation extraction purposes.

Data preparation for fine-tuning

This section provides details of the data processing and prompt design for the fine-tuning process of different models.

Discriminative models

The training data format for fine-tuning these models follows a simple structure. Each example consists of an input and an output. The input is represented by two strings, an Evidence String and a Question String, separated by a delimiter. For example, the Evidence string can be “Additionally, some members of the phylum such as Faecalibacterium prausnitzii, a member of the Clostridiales-Ruminococcaceae lineage have been shown to have anti-inflammatory effects due to the production of the short-chain fatty acid butyrate and have been negatively correlated with inflammatory bowel disease.” and Question as “What is the relationship between inflammatory bowel disease and Clostridiales ?”. The output corresponds to the target label associated with the given input as shown below.

This training data format allows for a straightforward mapping between the input evidence and question, and the corresponding output label. By fine-tuning the pre-trained models on the GSC dataset encoded as above, the model can learn to effectively understand the relationship between the evidence and question, and generate accurate labels or predictions based on the input provided.

Generative models

The training data format for GPT-3 model consists of a collection of examples, each represented by a prompt and a completion string that corresponds to the label.

The “prompt” key corresponds to the text that serves as the input or context for the model. It contains evidence related to the microbiome and disease relationship. In this format, the prompt text is structured as the evidence string followed by a question string, separated by a line break (“\n”). The evidence string provides the background or supporting information, while the question string represents the specific question to be answered by the model. For example, a prompt can be:

“Evidence: Additionally, some members of the phylum such as Faecalibacterium prausnitzii, a member of the Clostridiales-Ruminococcaceae lineage have been shown to have anti-inflammatory effects due to production of the short-chain fatty acid butyrate and have been negatively correlated with inflammatory bowel disease107.\n Question: What is the relationship between inflammatory bowel disease107 and Clostridiales ?\n\n####\n\n”. Here, the ending string “\n\n####\n\n” acts as a fixed separator to the model. For inference, the prompts are designed in the same format as the training dataset including the same separator with the same stop sequence to properly truncate the completion.

The training data format allows for multiple examples to be included, each following the same key-value structure. The detailed prompts for BioMedLM and BioGPT which are very similar to the GPT-3 prompts can be found on the supplementary website [see Additional file 1].

Table 2 Performance metrics different fine-tuned language modelsFull size tableResults

To mitigate the risk of overfitting, model performance was evaluated using a 5-fold cross-validation strategy. The curated dataset was divided into five equal parts, referred to as folds. In each iteration, the models were fine-tuned using four folds for training and evaluated on the remaining fold. This process was repeated five times, with each fold serving as the test set once. The average of the five test scores was calculated to provide the final metrics of the model. We assessed the performance using several metrics, including Accuracy, Weighted Average F1 score, Precision, and Recall, which align with those reported in [24]. The detailed results are presented in Table 2. The results of the study by [24] are shown in the table using the notation \(BERE_{TL} (MDI)\). The reported f1-score reached a peak value of 0.738 along with closely aligned precision and recall scores. These results serve as our baseline.

Fig. 6

Precision recall curve for the various fine-tuned pre-trained models used in discriminative setting

Full size image

Coming to the performance of the discriminative fine-tuned models, we observed significant improvements across the entire spectrum. Notably, the model trained on BioLinkBERT-base yielded the best results, achieving an average F1-score of \(0.804 \pm 0.0362\) in a 5-fold cross-validation setup. Detailed information regarding all the models and the fine-tuning parameters can be found on our supplementary website (see Additional file 1). Further, to understand the characteristics of the classifier better, we plotted the precision-recall curves as shown in Fig. 6. Notably, the area under the curve for BioLinkBERT-finetuned outperformed others, reaching 0.85, indicating the best performance.

Among the generative class of models, we found that the fine-tuned GPT-3 model yielded the best overall results, as shown in Table 2. However, in terms of precision, the model fine-tuned on BioMedLM performed well as shown in Table 2. However, we noticed a few observations regarding the use of these generative models. Firstly, we sometimes noticed variability in the results with each run of the model depending on the parameters used. There were also instances where the model produced empty outputs. Additionally, since these models are generative in nature, the outputs and probabilities generated by the model do not always align with well-defined class labels. This aspect further hinders our comprehension of how these models operate and raises concerns about the reliability of their outputs. Due to these limitations, we were unable to generate a precision-recall curve for GPT-3.

Table 3 Per class metrics for BioLinkBERT fine-tuned modelFull size tableTable 4 Per class metrics for GPT-3 fine-tuned modelFull size table

To gain deeper insights into the performance of the classifier and generative model, we analyzed the per-class performance metrics for both the fine-tuned generative models (GPT-3) and the discriminative models (BioLinkBERT model). As expected, the metrics for the negative, positive, and relate classes exhibited satisfactory results. However, we observed poor performance in the NA class for both the fine-tuned GPT-3 (refer to Table 4) model and the BioLinkBERT model (refer to Table 3). This deficiency in performance also accounts for the lower overall classification performance. There are two possible reasons for this outcome. Firstly, as previously discussed, the distribution of data in the dataset is imbalanced, with a smaller number of NA samples. Secondly, there may be inherent challenges in defining the classes in the original problem, which could necessitate further investigation and deliberation. However, exploring these concerns is beyond the scope of this paper.

Fig. 7

Comparing the MDIDB knowledge base generated using BERE (TL) model versus our prediction model

Full size imageComparison of outputs of our approach using a web-based solution

In the study [24], the authors utilized their best-performing model (\(BERE_{TL}\)) on a large corpus of text to extract disease-microbiome relationships, which they subsequently released as the MDIDB database.

We aimed to compare the outputs generated by our model with those in the MDIDB database. To accomplish this, we devised a straightforward graph and visualization strategy, as illustrated in the first panel of Fig. 7. The process involved running both models on the original set of evidence statements used in MDIDB and comparing the resulting graphs. In our graph representation, nodes correspond to diseases or microbes, while edges represent established relationships between them. Nodes are colored green when both algorithms agree on the nature of the relationship, and red when they disagree. We also developed a web application for this which is accessible on our supplementary website [see Additional file 1]. The user interface of the tool allows users to select the number of edges they wish to visualize from the larger graph. After specifying, for example, 50 edges in the provided text box, users can click the “generate knowledge graph” button to display the corresponding knowledge graph. Zooming and hovering over the edges of the graph provide information on the differences in predictions between the two models, including evidence text, for both red and green nodes (as depicted in Fig. 7). This approach aims to provide expert researchers with a more comprehensive understanding of the performance of the different models.

【本文地址】

公司简介

联系我们